{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HTML解析简介\n",
    "\n",
    "*  本周主要内容：HTML解析（parse HTML）及Xpath实践\n",
    "*  20春_Web数据挖掘_week02\n",
    "*  电子讲义原设计者：廖汉腾, 许智超\n",
    "*  电子讲义练习改写者：汤雨晴\n",
    "  * 本周电子讲义互评工作坊，依序做以下动作：\n",
    "     * 在e.nfu.edu.cn下载此文档\n",
    "     * 在自己本地端实操，把ipynb文档中的123456789改名为学号\n",
    "     * <mark>在还没有改动之前，先把此后缀为学号ipynb文档上传至Github为第一版</mark> (其它文档不计)\n",
    "     * 在自己本地端实操，练习所有内容，按需增减本讲义内容，含代码丶markdown丶新数据(含其连结)及\n",
    "     * 及格至少要做: <mark>**\"本周小结内容\" 以markdown语法，按上课及本电子讲义补充内容进行150-500字的摘要说明**</mark>，可利用HTML文内超连结连到同文档其他的笔记内容\n",
    "     * 互评时会要求提交自己文档的改动比较，以方便同学观看你的改动范围及内容\n",
    "  * 本周加分项，以抢快为主，<mark>1人最多只能抢1项</mark>，需以指定的url进行数据挖掘并输出excel，抢快时间<mark>首先</mark>以该代码在Github的提交时间为准，若两人Github提交时间相差不到3分钟，则以<mark>再以</mark>QQ群@老师时间为判断\n",
    "      * C-3 期末总分加1\n",
    "      * C-4 期末总分加2\n",
    "      * C-5 期末总分加5\n",
    "-----\n",
    "![for humans](https://requests-html.kennethreitz.org/_static/requests-html-logo.png)\n",
    "\n",
    "## 复习\n",
    "\n",
    "复习：上周内容，总观使用\n",
    "\n",
    "* requests-html  丶\n",
    "* pd.read_html 丶及\n",
    "* requests + lxml \n",
    "\n",
    "的Web数据挖掘内容，最主要包括以下前后的主要数据挖掘内容\n",
    "\n",
    "1. 使用 HTTP 发送请求（HTTP request）\n",
    "2. 判断 HTTP 及状态（HTTP status code） 及 HTTP 响应（HTTP response）是否正常\n",
    "3. 执行 HTML 解析（parse HTML），通常使用 xpath or CSS selector 选择器\n",
    "\n",
    "<br/>\n",
    "<br/>\n",
    "\n",
    "-----\n",
    "![Xpath Axis](http://krum.rz.uni-mannheim.de/inet-2005/images/xpath-axis.gif)\n",
    "\n",
    "\n",
    "## 本周内容及学习目标\n",
    "\n",
    "本周内容聚焦在第3.部分\n",
    "挑选比较容易Web数据挖掘的网页（i.e. 比较没有以上1. 及2. 的坑），学习解决以下挑战：\n",
    "\n",
    "1. 使用 requests-html 爬取并存取网页文字档，查找[requests-html 中文文档](https://cncert.github.io/requests-html-doc-cn/#/)\n",
    "2. 熟悉 [xpath 语法](https://www.w3cschool.cn/xpath/xpath-syntax.html)丶[xpath 节点](https://www.w3cschool.cn/xpath/xpath-nodes.html)\n",
    "3. 使用 [xpath cheatsheet](https://devhints.io/xpath)\n",
    "  * 在 Chrome Inspector 使用\n",
    "  * 在 requests-html (Python) 使用\n",
    "4. 简易使用 [pd.DataFrame]()\n",
    "\n",
    "学生将实践\n",
    "* 解析简单HTML页面\n",
    "* 使用xpath（不挑greedy vs. 及挑剔ungreedy的策略）\n",
    "* 获取标签tags丶属性attributes丶值values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "/* 本电子讲义使用之CSS */\n",
       "div.code_cell {\n",
       "    background-color: #e5f1fe;\n",
       "}\n",
       "div.cell.selected {\n",
       "    background-color: #effee2;\n",
       "    font-size: 2rem;\n",
       "    line-height: 2.4rem;\n",
       "}\n",
       "div.cell.selected .rendered_html table {\n",
       "    font-size: 2rem !important;\n",
       "    line-height: 2.4rem !important;\n",
       "}\n",
       ".rendered_html pre code {\n",
       "    background-color: #C4E4ff;   \n",
       "    padding: 2px 25px;\n",
       "}\n",
       ".rendered_html pre {\n",
       "    background-color: #99c9ff;\n",
       "}\n",
       "div.code_cell .CodeMirror {\n",
       "    font-size: 2rem !important;\n",
       "    line-height: 2.4rem !important;\n",
       "}\n",
       ".rendered_html img, .rendered_html svg {\n",
       "    max-width: 35%;\n",
       "    height: auto;\n",
       "    float: right;\n",
       "}\n",
       "/* Gradient transparent - color - transparent */\n",
       "hr {\n",
       "    border: 0;\n",
       "    border-bottom: 1px dashed #ccc;\n",
       "}\n",
       ".emoticon{\n",
       "    font-size: 5rem;\n",
       "    line-height: 4.4rem;\n",
       "    text-align: center;\n",
       "    vertical-align: middle;\n",
       "}\n",
       "</style>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<style>\n",
    "/* 本电子讲义使用之CSS */\n",
    "div.code_cell {\n",
    "    background-color: #e5f1fe;\n",
    "}\n",
    "div.cell.selected {\n",
    "    background-color: #effee2;\n",
    "    font-size: 2rem;\n",
    "    line-height: 2.4rem;\n",
    "}\n",
    "div.cell.selected .rendered_html table {\n",
    "    font-size: 2rem !important;\n",
    "    line-height: 2.4rem !important;\n",
    "}\n",
    ".rendered_html pre code {\n",
    "    background-color: #C4E4ff;   \n",
    "    padding: 2px 25px;\n",
    "}\n",
    ".rendered_html pre {\n",
    "    background-color: #99c9ff;\n",
    "}\n",
    "div.code_cell .CodeMirror {\n",
    "    font-size: 2rem !important;\n",
    "    line-height: 2.4rem !important;\n",
    "}\n",
    ".rendered_html img, .rendered_html svg {\n",
    "    max-width: 35%;\n",
    "    height: auto;\n",
    "    float: right;\n",
    "}\n",
    "/* Gradient transparent - color - transparent */\n",
    "hr {\n",
    "    border: 0;\n",
    "    border-bottom: 1px dashed #ccc;\n",
    "}\n",
    ".emoticon{\n",
    "    font-size: 5rem;\n",
    "    line-height: 4.4rem;\n",
    "    text-align: center;\n",
    "    vertical-align: middle;\n",
    "}\n",
    "</style>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 基本模块\n",
    "import pandas as pd\n",
    "from requests_html import HTMLSession"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# requsts-html\n",
    "学生将实践\n",
    "* 解析简单HTML页面\n",
    "\n",
    "使用 requests-html 爬取并存取网页文字档，查找[requests-html 中文文档](https://cncert.github.io/requests-html-doc-cn/#/)\n",
    "\n",
    "* API 文档\n",
    "  * HTML类\n",
    "  * Element类\n",
    "  * HTML Sessions (应正名为HTTP Sessions)  \n",
    "* [原文档](https://requests-html.kennethreitz.org//_modules/requests_html.html)\n",
    "\n",
    "要点：HTTP 和 HTML 的分工与合作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HTML类\n",
    "\n",
    "HTML文本的基本使用及保存备用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A1  nfu.edu.cn 搜 文学与传媒学院 保存备用\n",
    "payload = {\n",
    "    \"keyword\":\"文学与传媒学院\",\n",
    "    \"p\":\"1\"\n",
    "}\n",
    "\n",
    "session = HTMLSession()\n",
    "r = session.get(\"http://www.nfu.edu.cn/index.php/home/article/search.html\", params=payload)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\ufeff<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\\r\\n<html>\\r\\n<head>\\r\\n<meta name=\"renderer\" content=\"webkit\">\\r\\n<meta http-equiv=\"x-ua-compatible\" content=\"IE=edge\" >\\r\\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\\r\\n<title>-中山大学南方学院 </title>\\r\\n<meta name=\"keywords\" content=\"\">\\r\\n<meta name=\"description\" content=\"\">\\r\\n\\t\\n<link rel=\"stylesheet\" type=\"text/css\" href=\"/Public/Home/css/swiper-3.3.1.min.css\"/>\\n\\t\\t<link href=\"/Public/Home/css/lin.css\" rel=\"stylesheet\" type=\"text/css\" />\\n\\n\\t\\t<script src=\"/Public/Home/js/jquery-1.11.3.min.js\"></script>\\n\\t\\t<script src=\"/Public/Home/js/jquery-1.11.1.js\"></script>\\n\\t\\t<script src=\"/Public/Home/js/jquery.easie-min.js\" type=\"text/javascript\"></script>\\n\\t\\t<script src=\"/Public/Home/js/swiper.min.js\" type=\"text/javascript\"></script>\\n\\t\\t<script src=\"/Public/Home/js/lin.js\"></script>\\n\\t\\t\\n\\t\\t\\n<link href=\"/Public/Home/page.css\" rel=\"stylesheet\" type=\"text/css\" />\\n<link href=\"/Public/favicon.ico\" rel=\"Shortcut Icon\">\\n<link href=\"/Public/favicon.ico\" rel=\"Bookmark\">\\n\\r\\n\\t</head>\\r\\n<body>\\r\\n\\ufeff<!--头部-->\\n\\t\\t<div class=\"lin-header \">\\n\\t\\t\\t<div class=\"lin-head clearfix\">\\n\\t\\t\\t\\t<h1 class=\"lin-topl\"><a href=\"/index.php\" target=\"_blank\" title=\"中山大学南方学院\"><img src=\"/Public/Home/images/logo.png\"/></a></h1>\\n\\t\\t\\t\\t<div class=\"lin-topr\">\\n\\t\\t\\t\\t\\t<div class=\"lin-youxiang\">\\n\\t\\t\\t\\t\\t\\t<a href=\"http://oa.nfu.edu.cn/\" target=\"_blank\">办公系统</a>\\n\\t\\t\\t\\t\\t\\t<a href=\"http://en.nfu.edu.cn/\">English Version</a>\\n\\t\\t\\t\\t\\t\\t<!-- <a href=\"https://mail.nfu.edu.cn/\" target=\"_blank\">邮箱登录</a>\\n\\t\\t\\t\\t\\t\\t<a href=\"mailto:nfcsysuyz@126.com\" target=\"_blank\" title=\"nfcsysuyz@126.com\" >院长信箱</a> -->\\n\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t<div class=\"lin-ser lin-serhide\">\\n\\t\\t\\t\\t\\t\\t<div class=\"serbox\">\\n\\t\\t\\t\\t\\t\\t<form action=\"/index.php/home/article/search.html\" method=\"get\" id=\"search_form\">\\n\\t\\t\\t\\t\\t\\t\\t<input type=\"text\" name=\"keyword\" id=\"keyword\" placeholder=\"搜索\" />\\n\\t\\t\\t\\t\\t\\t\\t<a href=\"javascript:;\" id=\"search_btn\" ></a>\\n\\t\\t\\t\\t\\t\\t</form>\\t\\n\\t\\t\\t\\t\\t\\t<script type=\"text/javascript\">\\n\\t\\t\\t\\t\\t\\t\\t$(\"#search_btn\").click(function(){\\n\\t\\t\\t\\t\\t\\t\\t\\tvar keyword=$(\"#keyword\").val();\\n\\t\\t\\t\\t\\t\\t\\t\\tif(keyword==\\'\\'){\\n\\t\\t\\t\\t\\t\\t\\t\\t\\talert(\\'* 请输入搜索关键词 !\\');\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t$(\"#keyword\").focus();\\n\\t\\t\\t\\t\\t\\t\\t\\t\\treturn false;\\n\\t\\t\\t\\t\\t\\t\\t\\t}else{\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t$(\"#search_form\").submit();\\n\\t\\t\\t\\t\\t\\t\\t\\t}\\n\\t\\t\\t\\t\\t\\t\\t})\\n\\t\\t\\t\\t\\t\\t</script>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t\\t<!-- <span class=\"ser-biaoti\"><a href=\\'\\' style=\"color:#fff;\">English Version</a></span> -->\\n\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t</div>\\n\\t\\t</div>\\n\\t\\t<!-- end 头部-->\\n\\t\\t<!--导航条-->\\n\\t\\t<div class=\"lin-navbar\">\\n\\t\\t\\t<p class=\"navnav\">\\n\\t\\t\\t\\t<span></span>\\n\\t\\t\\t\\t<span></span>\\n\\t\\t\\t\\t<span></span>\\n\\t\\t\\t</p>\\n\\t\\t\\t<ul class=\"lin-nav clearfix\">\\n\\t\\t\\t\\t<li  class=\"lin-navli\"><a href=\"/index.php\">首页</a>\\n\\t\\t\\t\\t</li>\\n\\t\\t\\t\\t<li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/29.html\"  target=\"_blank\">学校概况</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/29.html\" target=\"_self\">学校简介</a></li><li><a href=\"/index.php/home/article/index/cid/30.html\" target=\"_blank\">现任领导</a></li><li><a href=\"/index.php/home/article/index/cid/135.html\" target=\"_self\">校徽  校训  校歌</a></li><li><a href=\"/index.php/home/article/index/cid/34.html\" target=\"_blank\">南方大事记</a></li><li><a href=\"/index.php/home/article/index/cid/104.html\" target=\"_self\">学校校历</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/2.html\"  target=\"_self\">党建之窗</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/61.html\"  target=\"_blank\">机构设置</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/61.html\" target=\"_self\">院系设置</a></li><li><a href=\"/index.php/home/article/index/cid/36.html\" target=\"_self\">管理机构</a></li><li><a href=\"/index.php/home/article/index/cid/165.html\" target=\"_self\">常设委员会</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/31.html\"  target=\"_blank\">人才培养</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/31.html\" target=\"_blank\">名师介绍</a></li><li><a href=\"/index.php/home/article/index/cid/163.html\" target=\"_self\">本科教育</a></li><li><a href=\"/index.php/home/article/index/cid/164.html\" target=\"_self\">继续教育</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/106.html\"  target=\"_blank\">教学科研</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/106.html\" target=\"_blank\">教务与科研部</a></li><li><a href=\"/index.php/home/article/index/cid/127.html\" target=\"_blank\">科研信息与动态</a></li><li><a href=\"/index.php/home/article/index/cid/107.html\" target=\"_blank\">科研机构</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/49.html\"  target=\"_blank\">招生就业</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/49.html\" target=\"_blank\">本科招生</a></li><li><a href=\"/index.php/home/article/index/cid/129.html\" target=\"_self\">继续教育</a></li><li><a href=\"/index.php/home/article/index/cid/50.html\" target=\"_blank\">就业服务</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/79.html\"  target=\"_blank\">图书馆</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/79.html\" target=\"_blank\">图书馆</a></li><li><a href=\"/index.php/home/article/index/cid/80.html\" target=\"_blank\">档案室</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/159.html\"  target=\"_blank\">合作交流</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/159.html\" target=\"_blank\">国际交流</a></li><li><a href=\"/index.php/home/article/index/cid/161.html\" target=\"_blank\">外事服务</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/44.html\"  target=\"_blank\">人才招聘</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/44.html\" target=\"_blank\">教师系列</a></li><li><a href=\"/index.php/home/article/index/cid/45.html\" target=\"_blank\">管理系列</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li><li class=\"lin-navli\"><a href=\"/index.php/home/article/index/cid/32.html\"  target=\"_blank\">走进南方</a>\\n\\n\\t\\t\\t\\t\\t<!-- <i f condition=\"!empty($nav[\\'son_list\\']) and $nav[id] !=3 and  $nav[id] !=4 and $nav[id] !=5 and $nav[id] !=89\"> -->\\n\\t\\t\\t\\t\\t<div class=\"lin-navdiv\">\\n\\t\\t\\t\\t\\t\\t<div class=\"sonnav-bg\">\\n\\t\\t\\t\\t\\t\\t\\t<ul class=\"nav-conul clearfix\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<li><a href=\"/index.php/home/article/index/cid/32.html\" target=\"_blank\">图说南方</a></li><li><a href=\"/index.php/home/article/index/cid/105.html\" target=\"_self\">生活服务</a></li><li><a href=\"/index.php/home/article/index/cid/87.html\" target=\"_self\">医疗服务</a></li><li><a href=\"/index.php/home/article/index/cid/51.html\" target=\"_blank\">校报</a></li><li><a href=\"/index.php/home/article/index/cid/82.html\" target=\"_self\">交通指引</a></li>\\t\\t\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t</div>\\t\\t\\t\\t</li>\\n\\t\\t\\t</ul>\\n\\t\\t\\t\\n\\t\\t</div>\\n\\t\\t<div class=\"lin-navbg\"></div>\\n\\n\\r\\n<div class=\"lin-content\">\\r\\n\\t\\t\\t<div class=\"lin-neiye clearfix\">\\r\\n\\t\\t\\t\\t\\r\\n\\r\\n\\t\\t\\t    <div class=\"search_list_right\">\\r\\n\\t\\t\\t        <div class=\"fan clearfix\">\\r\\n\\t\\t\\t            <span class=\"fan_title\">站内搜索</span>\\r\\n\\t\\t\\t            <span class=\"fan_right\">您当前位置是：<a href=\"/index.php\">网站首页</a> &gt; <font>站内搜索</font></span>\\r\\n\\t\\t\\t        </div>\\r\\n\\t\\t\\t        <div class=\"ny_content\">\\r\\n\\t\\t\\t\\t\\t\\t<ul class=\"list-ul\">\\r\\n\\t\\t\\t\\t\\t\\t<li><font class=\"right-more\">2020-01-06</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6363.html\" target=\"_blank\" title=\"文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会\"><font color=red>文学与传媒学院</font>教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2020-01-06</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6366.html\" target=\"_blank\" title=\"文学与传媒学院2019年学术研讨会暨总结大会顺利召开\"><font color=red>文学与传媒学院</font>2019年学术研讨会暨总结大会顺利召开</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-12-20</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6318.html\" target=\"_blank\" title=\"展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕\">展现当代青年的迷惘与奋进——我校<font color=red>文学与传媒学院</font>大型原创舞台剧《春至》圆满落幕</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-11-22</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6154.html\" target=\"_blank\" title=\"文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束\"><font color=red>文学与传媒学院</font>考研座谈暨2020年考研交流答疑会圆满结束</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-11-05</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5348.html\" target=\"_blank\" title=\"文学与传媒学院教师招聘启事\"><font color=red>文学与传媒学院</font>教师招聘启事</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-11-04</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6016.html\" target=\"_blank\" title=\"创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行\">创意无限，未来可期——<font color=red>文学与传媒学院</font>青马工程第四讲暨闭营仪式顺利举行</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-11-04</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/6019.html\" target=\"_blank\" title=\"垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行\">垃圾分类我先行——<font color=red>文学与传媒学院</font>“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-09-16</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5794.html\" target=\"_blank\" title=\"以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束\">以梦为马，不负韶华——<font color=red>文学与传媒学院</font>2019级新生开学典礼圆满结束</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-09-09</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5776.html\" target=\"_blank\" title=\"文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖\"><font color=red>文学与传媒学院</font>学子在全国高校数字艺术设计大赛中斩获大奖</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-09-09</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5777.html\" target=\"_blank\" title=\"文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩\"><font color=red>文学与传媒学院</font>学子在第七届中国大学生公共关系策划大赛中喜获佳绩</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-06-24</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5642.html\" target=\"_blank\" title=\"倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕\">倾心之作，致敬经典——<font color=red>文学与传媒学院</font>紫阳戏剧社《倾城之恋》话剧展演圆满落幕</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li><li><font class=\"right-more\">2019-06-24</font><div class=\"news_title\"><a href=\"/index.php/home/article/search_detail/id/5647.html\" target=\"_blank\" title=\"毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展\">毕业季 | 今朝有离别，青春不散场 ——<font color=red>文学与传媒学院</font>2019届毕业生毕业季系列活动有序开展</a></div>\\r\\n\\t\\t\\t\\t\\t\\t\\t</li>\\t\\t\\t\\t\\t\\t</ul>\\r\\n\\t\\t\\t\\t\\t\\t<div style=\"clear: both;\"></div>\\r\\n\\t\\t\\t\\t\\t\\t<div class=\"pages\" align=\"center\"><div>  <span class=\"current\">1</span><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/2.html\">2</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/3.html\">3</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/4.html\">4</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/5.html\">5</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/6.html\">6</a><a class=\"num\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/7.html\">7</a> <a class=\"next\" href=\"/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/2.html\">>></a> </div></div>\\r\\n\\t\\t\\t        </div>\\r\\n\\t\\t\\t    </div>\\r\\n\\t\\t\\t</div>\\r\\n\\t\\t</div>\\r\\n\\t\\t<!-- end 内容区域-->\\r\\n\\r\\n\\t<!--底部-->\\n\\t\\t<div class=\"lin-footer\">\\n\\t\\t\\t<div class=\"lin-fer clearfix\">\\n\\t\\t\\t\\t<div class=\"ferleft\">\\n\\t\\t\\t\\t\\t<ul class=\"fer-ul clearfix\">\\n\\t\\t\\t\\t\\t\\t<li class=\"fer-li\"><a href=\"http://www.moe.gov.cn/\" target=\"_blank\" title=\"教育部\">教育部</a></li><li class=\"fer-li\"><a href=\"http://www.gz.gov.cn/\" target=\"_blank\" title=\"广州市政府\">广州市政府</a></li><li class=\"fer-li\"><a href=\"http://www.cnki.net/\" target=\"_blank\" title=\"中国知网\">中国知网</a></li><li class=\"fer-li\"><a href=\"http://edu.gd.gov.cn\" target=\"_blank\" title=\"广东省教育厅\">广东省教育厅</a></li><li class=\"fer-li\"><a href=\"http://www.gdpr.com/\" target=\"_blank\" title=\"珠江投资\">珠江投资</a></li><li class=\"fer-li\"><a href=\"http://journal.nfu.edu.cn/CN/volumn/home.shtml\" target=\"_blank\" title=\"南方论丛\">南方论丛</a></li><li class=\"fer-li\"><a href=\"http://www.sysu.edu.cn/\" target=\"_blank\" title=\"中山大学 \">中山大学 </a></li><li class=\"fer-li\"><a href=\"http://www.nfu.edu.cn/index.php/home/article/index/cid/136.html\" target=\"_blank\" title=\"珠江教育联盟\">珠江教育联盟</a></li>\\t\\n\\t\\t\\t\\t\\t\\t<li class=\"fer-li\"><a href=\"/index.php/home/article/link.html\">更多>></a></li>\\n\\t\\t\\t\\t\\t</ul>\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t<div class=\"fercen\">\\n\\t\\t\\t\\t\\t<div class=\"fer-er\"><img src=\"/Public/Home/images/erweima1.jpg\"/></div>\\n\\t\\t\\t\\t\\t<div class=\"fer-er\"><img src=\"/Public/Home/images/erweima2.jpg\"/></div>\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t<div class=\"ferright\">\\n\\t\\t\\t\\t\\t<div><p><span>地址：广州市从化区温泉大道882号中山大学南方学院</span><span>邮编：510970</span></p></div>\\n\\t\\t\\t\\t\\t<div class=\"addleft\">\\n\\t\\t\\t\\t\\t\\t<p>联系电话：020-61787326</p>\\n\\t\\t\\t\\t\\t\\t<p>版权所有 ©  中山大学南方学院</p>\\n\\t\\t\\t\\t\\t\\t<p>技术支持：<a href=\"http://www.unsun.net\" target=\"_blank\">碧辉腾乐(UNSUN.NET)</a></p>\\n\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t<div class=\"addright\">\\n\\t\\t\\t\\t\\t\\t<p>招生咨询：020-87912619</p> \\n\\t\\t\\t\\t\\t\\t<p>\\n\\t\\t\\t\\t\\t\\t\\t<span class=\"add-spante\"><a target=\"_blank\" href=\"http://www.beian.miit.gov.cn\">粤ICP备11077779号</a></span> \\n\\t\\t\\t\\t\\t\\t\\t<span class=\"add-spante\">\\n\\t\\t\\t\\t\\t\\t\\t\\t<a href=\"/index.php/admin/index/login.html\" target=\"_blank\" >网站管理</a>&nbsp;&nbsp;\\n\\t\\t\\t\\t\\t\\t\\t\\t<a href=\"http://old.nfu.edu.cn/\" target=\"_blank\" >旧站入口</a>\\n\\t\\t\\t\\t\\t\\t\\t</span>\\n\\t\\t\\t\\t\\t\\t</p>\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t<div align=\"center\">\\n\\t\\t\\t\\t\\t\\t<a target=\"_blank\" href=\"http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=44011702000081\" style=\"display:inline-block;text-decoration:none;height:20px;line-height:20px;\"><img src=\"/Public/Home/images/icp.png\" style=\"float:left;\"/><p style=\"float:left;height:20px;line-height:20px;margin: 0px 0px 0px 5px; color:#ffffff;\">粤公网安备 44011702000081号</p></a>\\n\\t\\t\\t\\t</div>\\n\\t\\t\\t</div>\\n\\t\\t</div>\\n\\t\\t<!-- end 底部-->\\n\\n <script type=\"text/javascript\" language=\"javascript\">\\n \\n    //加入收藏\\n \\n        function AddFavorite(sURL, sTitle) {\\n \\n            sURL = encodeURI(sURL); \\n        try{   \\n \\n            window.external.addFavorite(sURL, sTitle);   \\n \\n        }catch(e) {   \\n \\n            try{   \\n \\n                window.sidebar.addPanel(sTitle, sURL, \"\");   \\n \\n            }catch (e) {   \\n \\n                alert(\"加入收藏失败，请使用Ctrl+D进行添加,或手动在浏览器里进行设置.\");\\n \\n            }   \\n \\n        }\\n \\n    }\\n \\n    //设为首页\\n \\n    function SetHome(url){\\n \\n        if (document.all) {\\n \\n            document.body.style.behavior=\\'url(#default#homepage)\\';\\n \\n               document.body.setHomePage(url);\\n \\n        }else{\\n \\n            alert(\"您好,您的浏览器不支持自动设置页面为首页功能,请您手动在浏览器里设置该页面为首页!\");\\n \\n        }\\n \\n    }\\n \\n</script>\\n\\r\\n\\r\\n</body>\\r\\n</html>'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#  A2  nfu.edu.cn 搜 文学与传媒学院 r.html\n",
    "# r.html  (HTML 元素/标签) \n",
    "\n",
    "r.html.html  \n",
    "# 可存网页为文字档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A3  nfu.edu.cn 搜 文学与传媒学院 保存备用\n",
    "\n",
    "with open (\"20春_Web数据挖掘_week02_nfu_文学与传媒学院.html\", encoding = \"utf8\", mode = \"w\") as fp:\n",
    "    fp.write(r.html.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A4  复习 读保存备用的任何文字档\n",
    "\n",
    "with open (\"20春_Web数据挖掘_week02_nfu_文学与传媒学院.html\", encoding = \"utf8\", mode = \"r\") as fp:\n",
    "    html_load = fp.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A5  前方高能 HTML文本的解析\n",
    "\n",
    "import requests_html\n",
    "parsed = requests_html.soup_parse(html_load)\n",
    "解析后 = requests_html.soup_parse(html_load)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 解析文本\n",
    "![HTML head body](https://rlv.zcache.com/head_body_t_shirt-rd222a2cce3704f3b87fae4ee0fb73744_k2gm8_307.jpg)\n",
    "HTML文本的解析\n",
    "\n",
    "```python\n",
    "\n",
    "parsed = requests_html.soup_parse(html_load)\n",
    "```\n",
    "\n",
    "```python\n",
    "\n",
    "import requests_html\n",
    "# parsed = requests_html.soup_parse(html_load)\n",
    "from requests_html import soup_parse\n",
    "# parsed = soup_parse(html_load)\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Element html at 0x227365a8d18>"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后   # <html> 元素标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element body at 0x227365ad3b8>]"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后.xpath('body')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element head at 0x227365ad7c8>]"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后.xpath('head')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element a at 0x2273660a188>,\n",
       " <Element a at 0x227365adae8>,\n",
       " <Element a at 0x227365adb38>,\n",
       " <Element a at 0x227365adb88>,\n",
       " <Element a at 0x227365adbd8>,\n",
       " <Element a at 0x227365adc78>,\n",
       " <Element a at 0x227365adcc8>,\n",
       " <Element a at 0x227365add18>,\n",
       " <Element a at 0x227365add68>,\n",
       " <Element a at 0x227365addb8>,\n",
       " <Element a at 0x227365ade08>,\n",
       " <Element a at 0x227365ade58>,\n",
       " <Element a at 0x227365adea8>,\n",
       " <Element a at 0x227365adef8>,\n",
       " <Element a at 0x227365adf48>,\n",
       " <Element a at 0x227365adf98>,\n",
       " <Element a at 0x227366b3048>,\n",
       " <Element a at 0x227366b3098>,\n",
       " <Element a at 0x227366b30e8>,\n",
       " <Element a at 0x227366b3138>,\n",
       " <Element a at 0x227366b3188>,\n",
       " <Element a at 0x227366b31d8>,\n",
       " <Element a at 0x227366b3228>,\n",
       " <Element a at 0x227366b3278>,\n",
       " <Element a at 0x227366b32c8>,\n",
       " <Element a at 0x227366b3318>,\n",
       " <Element a at 0x227366b3368>,\n",
       " <Element a at 0x227366b33b8>,\n",
       " <Element a at 0x227366b3408>,\n",
       " <Element a at 0x227366b3458>,\n",
       " <Element a at 0x227366b34a8>,\n",
       " <Element a at 0x227366b34f8>,\n",
       " <Element a at 0x227366b3548>,\n",
       " <Element a at 0x227366b3598>,\n",
       " <Element a at 0x227366b35e8>,\n",
       " <Element a at 0x227366b3638>,\n",
       " <Element a at 0x227366b3688>,\n",
       " <Element a at 0x227366b36d8>,\n",
       " <Element a at 0x227366b3728>,\n",
       " <Element a at 0x227366b3778>,\n",
       " <Element a at 0x227366b37c8>,\n",
       " <Element a at 0x227366b3818>,\n",
       " <Element a at 0x227366b3868>,\n",
       " <Element a at 0x227366b38b8>,\n",
       " <Element a at 0x227366b3908>,\n",
       " <Element a at 0x227366b3958>,\n",
       " <Element a at 0x227366b39a8>,\n",
       " <Element a at 0x227366b39f8>,\n",
       " <Element a at 0x227366b3a48>,\n",
       " <Element a at 0x227366b3a98>,\n",
       " <Element a at 0x227366b3ae8>,\n",
       " <Element a at 0x227366b3b38>,\n",
       " <Element a at 0x227366b3b88>,\n",
       " <Element a at 0x227366b3bd8>,\n",
       " <Element a at 0x227366b3c28>,\n",
       " <Element a at 0x227366b3c78>,\n",
       " <Element a at 0x227366b3cc8>,\n",
       " <Element a at 0x227366b3d18>,\n",
       " <Element a at 0x227366b3d68>,\n",
       " <Element a at 0x227366b3db8>,\n",
       " <Element a at 0x227366b3e08>,\n",
       " <Element a at 0x227366b3e58>,\n",
       " <Element a at 0x227366b3ea8>,\n",
       " <Element a at 0x227366b3ef8>,\n",
       " <Element a at 0x227366b3f48>,\n",
       " <Element a at 0x227366b3f98>,\n",
       " <Element a at 0x227366b4048>,\n",
       " <Element a at 0x227366b4098>,\n",
       " <Element a at 0x227366b40e8>,\n",
       " <Element a at 0x227366b4138>,\n",
       " <Element a at 0x227366b4188>,\n",
       " <Element a at 0x227366b41d8>,\n",
       " <Element a at 0x227366b4228>,\n",
       " <Element a at 0x227366b4278>,\n",
       " <Element a at 0x227366b42c8>,\n",
       " <Element a at 0x227366b4318>,\n",
       " <Element a at 0x227366b4368>]"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后.xpath('//a')  # greedy 所有<html> 元素标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "解析后.xpath('//*[@class=\"news_title\"]//a/@title')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 直接使用\n",
    "\n",
    "```python\n",
    "r.html.xpath()\n",
    "```\n",
    "\n",
    "你来试试?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会', '文学与传媒学院2019年学术研讨会暨总结大会顺利召开', '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕', '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束', '文学与传媒学院教师招聘启事', '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行', '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行', '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束', '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖', '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩', '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕', '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']\n",
      "['/index.php/home/article/search_detail/id/6363.html', '/index.php/home/article/search_detail/id/6366.html', '/index.php/home/article/search_detail/id/6318.html', '/index.php/home/article/search_detail/id/6154.html', '/index.php/home/article/search_detail/id/5348.html', '/index.php/home/article/search_detail/id/6016.html', '/index.php/home/article/search_detail/id/6019.html', '/index.php/home/article/search_detail/id/5794.html', '/index.php/home/article/search_detail/id/5776.html', '/index.php/home/article/search_detail/id/5777.html', '/index.php/home/article/search_detail/id/5642.html', '/index.php/home/article/search_detail/id/5647.html']\n",
      "['2020-01-06', '2020-01-06', '2019-12-20', '2019-11-22', '2019-11-05', '2019-11-04', '2019-11-04', '2019-09-16', '2019-09-09', '2019-09-09', '2019-06-24', '2019-06-24']\n"
     ]
    }
   ],
   "source": [
    "# A6  直接使用 requests-html\n",
    "payload = {\n",
    "    \"keyword\":\"文学与传媒学院\",\n",
    "    \"p\":\"1\"\n",
    "}\n",
    "\n",
    "session = HTMLSession()\n",
    "r = session.get(\"http://www.nfu.edu.cn/index.php/home/article/search.html\", params=payload)\n",
    "\n",
    "# 保存备用 (好习惯, 最好存一个地方)\n",
    "with open (\"20春_Web数据挖掘_week02_nfu_文学与传媒学院.html\", encoding = \"utf8\", mode = \"w\") as fp:\n",
    "    fp.write(r.html.html)\n",
    "\n",
    "# 解析文本直接使用\n",
    "print (r.html.xpath('//*[@class=\"news_title\"]/a/@title'))\n",
    "print (r.html.xpath('//*[@class=\"news_title\"]/a/@href'))\n",
    "print (r.html.xpath('//font[@class=\"right-more\"]/text()'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# xpath\n",
    "\n",
    "学生将实践\n",
    "\n",
    "* r.html 如 庖丁解牛\n",
    "* r.html.xapth() 挑牛肉吃\n",
    "  * **元素/标签如筋骨，值和文本通常才有牛肉**\n",
    "  * html以元素/标签构成\n",
    "  * 值和文本不单独存在，必需依附元素/标签\n",
    "* 使用xpath（不挑greedy vs. 及挑剔ungreedy的策略）\n",
    "\n",
    "* 获取标签tags丶属性attributes丶值values\n",
    "  * 掌握Chrome Inspector 多种颜色区分\n",
    "      * 元素/标签 elements/tags  ?色\n",
    "      * 属性attributes  ?色\n",
    "      * 值values ?色\n",
    "      * 文本 ?色\n",
    "      * HTML注解 ?色 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 取值及文本"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-A-1 使用 取值 观察xpath最后的内容\n",
    "解析后.xpath('//div[@class=\"news_title\"]/a/@title')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校',\n",
       " '大型原创舞台剧《春至》圆满落幕',\n",
       " '考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '教师招聘启事',\n",
       " '创意无限，未来可期——',\n",
       " '青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——',\n",
       " '“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——',\n",
       " '2019级新生开学典礼圆满结束',\n",
       " '学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——',\n",
       " '紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——',\n",
       " '2019届毕业生毕业季系列活动有序开展']"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-A-2 使用 文本\n",
    "解析后.xpath('//div[@class=\"news_title\"]/a/text()')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B-A-3 該你了\n",
    "# 你是否能解釋為什麼 B1 和 B2 結果不一樣 ?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- //div[@class=\"news_title\"]/a/@title  ：获取了a标签里的title属性\n",
    "- //div[@class=\"news_title\"]/a/text()  ：获取a标签以内的文字值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据科学家使用xpath的角度\n",
    "![](https://qxf2.com/blog/wp-content/uploads/2015/12/Table.png)\n",
    "* 不挑greedy的策略\n",
    "    * 求全求不漏\n",
    "    * 可能有垃圾\n",
    "    * 使用 // （后代descendant） 而不用  / 子女（children）\n",
    "    * 使用 **任意**元素/标签  而不用  **指定**元素/标签\n",
    "* 挑剔ungreedy的策略\n",
    "    * 求准求数据整齐\n",
    "    * 可能有漏数据\n",
    "    * 使用  / 子女（children） 而不用  // （后代descendant）\n",
    "    * 使用 **指定**元素/标签  而不用  **任意**元素/标签 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 不挑greedy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element 'a' href='/index.php' target='_blank' title='中山大学南方学院'>,\n",
       " <Element 'a' href='http://oa.nfu.edu.cn/' target='_blank'>,\n",
       " <Element 'a' href='http://en.nfu.edu.cn/'>,\n",
       " <Element 'a' href='javascript:;' id='search_btn'>,\n",
       " <Element 'a' href='/index.php'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/29.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/29.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/30.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/135.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/34.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/104.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/2.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/61.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/61.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/36.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/165.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/31.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/31.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/163.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/164.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/106.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/106.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/127.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/107.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/49.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/49.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/129.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/50.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/79.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/79.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/80.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/159.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/159.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/161.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/44.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/44.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/45.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/32.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/32.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/105.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/87.html' target='_self'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/51.html' target='_blank'>,\n",
       " <Element 'a' href='/index.php/home/article/index/cid/82.html' target='_self'>,\n",
       " <Element 'a' href='/index.php'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6363.html' target='_blank' title='文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6366.html' target='_blank' title='文学与传媒学院2019年学术研讨会暨总结大会顺利召开'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6318.html' target='_blank' title='展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6154.html' target='_blank' title='文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5348.html' target='_blank' title='文学与传媒学院教师招聘启事'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6016.html' target='_blank' title='创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6019.html' target='_blank' title='垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5794.html' target='_blank' title='以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5776.html' target='_blank' title='文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5777.html' target='_blank' title='文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5642.html' target='_blank' title='倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5647.html' target='_blank' title='毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/2.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/3.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/4.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/5.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/6.html'>,\n",
       " <Element 'a' class=('num',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/7.html'>,\n",
       " <Element 'a' class=('next',) href='/index.php/home/article/search/keyword/%E6%96%87%E5%AD%A6%E4%B8%8E%E4%BC%A0%E5%AA%92%E5%AD%A6%E9%99%A2/p/2.html'>,\n",
       " <Element 'a' href='http://www.moe.gov.cn/' target='_blank' title='教育部'>,\n",
       " <Element 'a' href='http://www.gz.gov.cn/' target='_blank' title='广州市政府'>,\n",
       " <Element 'a' href='http://www.cnki.net/' target='_blank' title='中国知网'>,\n",
       " <Element 'a' href='http://edu.gd.gov.cn' target='_blank' title='广东省教育厅'>,\n",
       " <Element 'a' href='http://www.gdpr.com/' target='_blank' title='珠江投资'>,\n",
       " <Element 'a' href='http://journal.nfu.edu.cn/CN/volumn/home.shtml' target='_blank' title='南方论丛'>,\n",
       " <Element 'a' href='http://www.sysu.edu.cn/' target='_blank' title='中山大学 '>,\n",
       " <Element 'a' href='http://www.nfu.edu.cn/index.php/home/article/index/cid/136.html' target='_blank' title='珠江教育联盟'>,\n",
       " <Element 'a' href='/index.php/home/article/link.html'>,\n",
       " <Element 'a' href='http://www.unsun.net' target='_blank'>,\n",
       " <Element 'a' href='http://www.beian.miit.gov.cn' target='_blank'>,\n",
       " <Element 'a' href='/index.php/admin/index/login.html' target='_blank'>,\n",
       " <Element 'a' href='http://old.nfu.edu.cn/' target='_blank'>,\n",
       " <Element 'a' href='http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=44011702000081' style='display:inline-block;text-decoration:none;height:20px;line-height:20px;' target='_blank'>]"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-1 使用 xpath \n",
    "r.html.xpath('//a')  # greedy 不挑 所有 <a> 元素标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['中山大学南方学院',\n",
       " '文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展',\n",
       " '教育部',\n",
       " '广州市政府',\n",
       " '中国知网',\n",
       " '广东省教育厅',\n",
       " '珠江投资',\n",
       " '南方论丛',\n",
       " '中山大学 ',\n",
       " '珠江教育联盟']"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-2 使用 xpath 限定 取特定属性\n",
    "# 注意和 B1 的内容相比, 是不是少了一些? \n",
    "# 没有特定属性title就不会被选到\n",
    "解析后.xpath('//a/@title')  # less greedy 有点挑 <a> 元素标签"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['中山大学南方学院',\n",
       " '文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展',\n",
       " '教育部',\n",
       " '广州市政府',\n",
       " '中国知网',\n",
       " '广东省教育厅',\n",
       " '珠江投资',\n",
       " '南方论丛',\n",
       " '中山大学 ',\n",
       " '珠江教育联盟']"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-3 使用 xpath  \n",
    "# greedy 不挑元素/标签  只挑有特定属性title的所有元素\n",
    "r.html.xpath('//*/@title')  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 挑剔ungreedy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会',\n",
       " '文学与传媒学院2019年学术研讨会暨总结大会顺利召开',\n",
       " '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕',\n",
       " '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束',\n",
       " '文学与传媒学院教师招聘启事',\n",
       " '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行',\n",
       " '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行',\n",
       " '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束',\n",
       " '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖',\n",
       " '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩',\n",
       " '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕',\n",
       " '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-4 使用 xpath  # ungreedy 更精準\n",
    "r.html.xpath('//div[@class=\"news_title\"]/a/@title')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element 'a' href='/index.php/home/article/search_detail/id/6363.html' target='_blank' title='文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6366.html' target='_blank' title='文学与传媒学院2019年学术研讨会暨总结大会顺利召开'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6318.html' target='_blank' title='展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6154.html' target='_blank' title='文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5348.html' target='_blank' title='文学与传媒学院教师招聘启事'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6016.html' target='_blank' title='创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6019.html' target='_blank' title='垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5794.html' target='_blank' title='以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5776.html' target='_blank' title='文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5777.html' target='_blank' title='文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5642.html' target='_blank' title='倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5647.html' target='_blank' title='毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展'>]"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-B-5 使用 xpath  # ungreedy 最精准\n",
    "#   xpath 完全没有用 // 也没有用 * \n",
    "r.html.xpath('body/div[@class=\"lin-content\"]/div[@class=\"lin-neiye clearfix\"]/div[@class=\"search_list_right\"]/div[@class=\"ny_content\"]/ul[@class=\"list-ul\"]/li/div[@class=\"news_title\"]/a')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 如果这样呢？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<Element 'a' href='/index.php/home/article/search_detail/id/6363.html' target='_blank' title='文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6366.html' target='_blank' title='文学与传媒学院2019年学术研讨会暨总结大会顺利召开'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6318.html' target='_blank' title='展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6154.html' target='_blank' title='文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5348.html' target='_blank' title='文学与传媒学院教师招聘启事'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6016.html' target='_blank' title='创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/6019.html' target='_blank' title='垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5794.html' target='_blank' title='以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5776.html' target='_blank' title='文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5777.html' target='_blank' title='文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5642.html' target='_blank' title='倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕'>,\n",
       " <Element 'a' href='/index.php/home/article/search_detail/id/5647.html' target='_blank' title='毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展'>]"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r.html.xpath('//div[@class=\"news_title\"]/a')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Xpath Axis](http://krum.rz.uni-mannheim.de/inet-2005/images/xpath-axis.gif)\n",
    "## 更多xpath\n",
    "xpath是一门在XML文档（包括html，以樹狀為主的純本文結構文檔）中查找信息的语言\n",
    "\n",
    "2. 熟悉 [xpath 语法](https://www.w3cschool.cn/xpath/xpath-syntax.html)丶[xpath 节点](https://www.w3cschool.cn/xpath/xpath-nodes.html)\n",
    "    * 节点\n",
    "        * 元素丶属性丶文本丶命名空间丶文档（根）结点\n",
    "    * 节点关系 \n",
    "        * 父母（parent） vs.先辈（ancestor）\n",
    "        * 子女（children） vs. 后代（descendant）\n",
    "        * 同胞（sibling）\n",
    "3. 使用 [xpath cheatsheet](https://devhints.io/xpath)\n",
    "  * 在 Chrome Inspector 使用\n",
    "  * 在 requests-html (Python) 使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会', '文学与传媒学院2019年学术研讨会暨总结大会顺利召开', '展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕', '文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束', '文学与传媒学院教师招聘启事', '创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行', '垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行', '以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束', '文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖', '文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩', '倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕', '毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展']\n"
     ]
    }
   ],
   "source": [
    "# B-C-1 \n",
    "print (r.html.xpath('//div[@class=\"news_title\"]/a/@title'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['/index.php/home/article/search_detail/id/6363.html', '/index.php/home/article/search_detail/id/6366.html', '/index.php/home/article/search_detail/id/6318.html', '/index.php/home/article/search_detail/id/6154.html', '/index.php/home/article/search_detail/id/5348.html', '/index.php/home/article/search_detail/id/6016.html', '/index.php/home/article/search_detail/id/6019.html', '/index.php/home/article/search_detail/id/5794.html', '/index.php/home/article/search_detail/id/5776.html', '/index.php/home/article/search_detail/id/5777.html', '/index.php/home/article/search_detail/id/5642.html', '/index.php/home/article/search_detail/id/5647.html']\n"
     ]
    }
   ],
   "source": [
    "# B-C-2\n",
    "print (r.html.xpath('//div[@class=\"news_title\"]/a/@href'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['2020-01-06', '2020-01-06', '2019-12-20', '2019-11-22', '2019-11-05', '2019-11-04', '2019-11-04', '2019-09-16', '2019-09-09', '2019-09-09', '2019-06-24', '2019-06-24']\n"
     ]
    }
   ],
   "source": [
    "# B-C-3\n",
    "print (r.html.xpath('//font[@class=\"right-more\"]/text()'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['2020-01-06', '2020-01-06', '2019-12-20', '2019-11-22', '2019-11-05', '2019-11-04', '2019-11-04', '2019-09-16', '2019-09-09', '2019-09-09', '2019-06-24', '2019-06-24']\n"
     ]
    }
   ],
   "source": [
    "# B-C-4\n",
    "print (r.html.xpath('//div[@class=\"news_title\"]/preceding-sibling::font/text()'))\n",
    "\n",
    "## 廖老师主张这个 B-C-4代码，会比B-C-3更好，你能不能从B-C-1, 及 B-C-2观察xpath语法，猜猜为什麽"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> preceding-sibling的解释是“选取当前节点之前的所有同级节点”"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 所以preceding-sibling::font截取的是与`<div class=\"new_title\">`标签之前同级的<font>里的文字值，好在 使用了同个节点来定位？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用pandas 輸出xlsx\n",
    "4. 简易使用 [pd.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>标题</th>\n",
       "      <th>链结</th>\n",
       "      <th>日期</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会</td>\n",
       "      <td>/index.php/home/article/search_detail/id/6363....</td>\n",
       "      <td>2020-01-06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>文学与传媒学院2019年学术研讨会暨总结大会顺利召开</td>\n",
       "      <td>/index.php/home/article/search_detail/id/6366....</td>\n",
       "      <td>2020-01-06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕</td>\n",
       "      <td>/index.php/home/article/search_detail/id/6318....</td>\n",
       "      <td>2019-12-20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束</td>\n",
       "      <td>/index.php/home/article/search_detail/id/6154....</td>\n",
       "      <td>2019-11-22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>文学与传媒学院教师招聘启事</td>\n",
       "      <td>/index.php/home/article/search_detail/id/5348....</td>\n",
       "      <td>2019-11-05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行</td>\n",
       "      <td>/index.php/home/article/search_detail/id/6016....</td>\n",
       "      <td>2019-11-04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行</td>\n",
       "      <td>/index.php/home/article/search_detail/id/6019....</td>\n",
       "      <td>2019-11-04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束</td>\n",
       "      <td>/index.php/home/article/search_detail/id/5794....</td>\n",
       "      <td>2019-09-16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖</td>\n",
       "      <td>/index.php/home/article/search_detail/id/5776....</td>\n",
       "      <td>2019-09-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩</td>\n",
       "      <td>/index.php/home/article/search_detail/id/5777....</td>\n",
       "      <td>2019-09-09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕</td>\n",
       "      <td>/index.php/home/article/search_detail/id/5642....</td>\n",
       "      <td>2019-06-24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展</td>\n",
       "      <td>/index.php/home/article/search_detail/id/5647....</td>\n",
       "      <td>2019-06-24</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                标题  \\\n",
       "0             文学与传媒学院教师获邀参加2020年U40中澳暑期工作营及国际学术研讨会   \n",
       "1                       文学与传媒学院2019年学术研讨会暨总结大会顺利召开   \n",
       "2           展现当代青年的迷惘与奋进——我校文学与传媒学院大型原创舞台剧《春至》圆满落幕   \n",
       "3                     文学与传媒学院考研座谈暨2020年考研交流答疑会圆满结束   \n",
       "4                                    文学与传媒学院教师招聘启事   \n",
       "5               创意无限，未来可期——文学与传媒学院青马工程第四讲暨闭营仪式顺利举行   \n",
       "6      垃圾分类我先行——文学与传媒学院“分门别类，谁与争锋”垃圾分类趣味知识竞赛决赛顺利举行   \n",
       "7                以梦为马，不负韶华——文学与传媒学院2019级新生开学典礼圆满结束   \n",
       "8                      文学与传媒学院学子在全国高校数字艺术设计大赛中斩获大奖   \n",
       "9                  文学与传媒学院学子在第七届中国大学生公共关系策划大赛中喜获佳绩   \n",
       "10           倾心之作，致敬经典——文学与传媒学院紫阳戏剧社《倾城之恋》话剧展演圆满落幕   \n",
       "11  毕业季 | 今朝有离别，青春不散场 ——文学与传媒学院2019届毕业生毕业季系列活动有序开展   \n",
       "\n",
       "                                                   链结          日期  \n",
       "0   /index.php/home/article/search_detail/id/6363....  2020-01-06  \n",
       "1   /index.php/home/article/search_detail/id/6366....  2020-01-06  \n",
       "2   /index.php/home/article/search_detail/id/6318....  2019-12-20  \n",
       "3   /index.php/home/article/search_detail/id/6154....  2019-11-22  \n",
       "4   /index.php/home/article/search_detail/id/5348....  2019-11-05  \n",
       "5   /index.php/home/article/search_detail/id/6016....  2019-11-04  \n",
       "6   /index.php/home/article/search_detail/id/6019....  2019-11-04  \n",
       "7   /index.php/home/article/search_detail/id/5794....  2019-09-16  \n",
       "8   /index.php/home/article/search_detail/id/5776....  2019-09-09  \n",
       "9   /index.php/home/article/search_detail/id/5777....  2019-09-09  \n",
       "10  /index.php/home/article/search_detail/id/5642....  2019-06-24  \n",
       "11  /index.php/home/article/search_detail/id/5647....  2019-06-24  "
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# B-D-1 pd.DataFrame 建构，pandas课有教\n",
    "df = pd.DataFrame( {\n",
    "         \"标题\": r.html.xpath('//div[@class=\"news_title\"]/a/@title'),\n",
    "         \"链结\": r.html.xpath('//div[@class=\"news_title\"]/a/@href'),\n",
    "         \"日期\": r.html.xpath('//font[@class=\"right-more\"]/text()'),\n",
    "     } )\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B-D-2 pd.DataFrame 输出excel，pandas课有教\n",
    "df.to_excel(\"20春_Web数据挖掘_week02_nfu_文学与传媒学院.xlsx\", sheet_name=\"搜查结果\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 本周小结内容\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 打开Excel档看成果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 课后练习及下周项目m.liepin.com\n",
    "\n",
    "使用 xpath 应用 [m.liepin.com](https://m.liepin.com/zhaopin/)\n",
    "\n",
    "你是数据科学家，这m.liepin.com有什麽样的牛肉，你打算要怎麽抓？\n",
    "* 工作名称\n",
    "* 工作地点\n",
    "* 工作$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C-1   单一页面\n",
    "url = \"https://m.liepin.com/zhaopin/?keyword=pandas\"\n",
    "session = HTMLSession()\n",
    "r = session.get( url )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C-2 保存备用\n",
    "with open (\"20春_Web数据挖掘_week02_zhaopin_pandas.html\", encoding = \"utf8\", mode = \"w\") as fp:\n",
    "    fp.write(r.html.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['数据分析助理 ', ' Python Architect/Developer \\xa0\\xa0Python架构师/开发 ', '数据分析工程师 ', 'AI工程师（实习岗） ', 'python ', '量化策略分析师 ', 'python数据开发工程师 ', 'python数据工程师 ', '算法科学家 ', 'nlp语音人工智能工程师（外包） ', '中高级python开发 ', 'SPBU-数据分析师（风控） ', 'python数据工程师 ', 'python开发工程师 ', '数据挖掘工程师 (MJ002691) ', '算法工程师（电商安全业务） ', '机器学习专家 ', 'python培训师 ', 'Python developer-ETL ', '视觉算法工程师 ', '量化基金投资经理（总监/副总监/高级经理/经理/助理） ', 'Python数据分析及机器学习讲师 ', '中级Python开发工程师 ', '业务数据分析师 ', 'python开发工程师 ']\n",
      "['https://m.liepin.com/job/1926863845.shtml', 'https://m.liepin.com/job/1926257725.shtml', 'https://m.liepin.com/job/1923135935.shtml', 'https://m.liepin.com/job/1916670241.shtml', 'https://m.liepin.com/job/1926919427.shtml', 'https://m.liepin.com/job/1925083443.shtml', 'https://m.liepin.com/job/1925012965.shtml', 'https://m.liepin.com/a/19449311.shtml', 'https://m.liepin.com/a/19215715.shtml', 'https://m.liepin.com/a/19004681.shtml', 'https://m.liepin.com/a/18847711.shtml', 'https://m.liepin.com/job/1922750183.shtml', 'https://m.liepin.com/job/1926169045.shtml', 'https://m.liepin.com/job/1925786431.shtml', 'https://m.liepin.com/job/1926206729.shtml', 'https://m.liepin.com/job/1926146353.shtml', 'https://m.liepin.com/job/1925780721.shtml', 'https://m.liepin.com/job/1925142097.shtml', 'https://m.liepin.com/job/1924665117.shtml', 'https://m.liepin.com/job/1923016243.shtml', 'https://m.liepin.com/job/1923774079.shtml', 'https://m.liepin.com/job/1916848263.shtml', 'https://m.liepin.com/job/1916900193.shtml', 'https://m.liepin.com/job/1918822599.shtml', 'https://m.liepin.com/job/1916062237.shtml']\n",
      "['5-7k·12薪', '26-40k·12薪', '8-15k·13薪', '5-7k·12薪', '10-15k·12薪', '25-35k·12薪', '10-15k·12薪', '25-35k·12薪', '65-70k·16薪', '30-54k·13薪', '20-40k·13薪', '15-30k·16薪', '10-15k·12薪', '15-22k·13薪', '20-40k·13薪', '20-40k·12薪', '面议', '10-15k·12薪', '12-16k·14薪', '20-30k·12薪', '12-35k·13薪', '20-30k·12薪', '面议', '7-9k·12薪', '面议']\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>职称</th>\n",
       "      <th>链接</th>\n",
       "      <th>薪水</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>数据分析助理</td>\n",
       "      <td>https://m.liepin.com/job/1926863845.shtml</td>\n",
       "      <td>5-7k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Python Architect/Developer   Python架构师/开发</td>\n",
       "      <td>https://m.liepin.com/job/1926257725.shtml</td>\n",
       "      <td>26-40k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>数据分析工程师</td>\n",
       "      <td>https://m.liepin.com/job/1923135935.shtml</td>\n",
       "      <td>8-15k·13薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AI工程师（实习岗）</td>\n",
       "      <td>https://m.liepin.com/job/1916670241.shtml</td>\n",
       "      <td>5-7k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>python</td>\n",
       "      <td>https://m.liepin.com/job/1926919427.shtml</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>量化策略分析师</td>\n",
       "      <td>https://m.liepin.com/job/1925083443.shtml</td>\n",
       "      <td>25-35k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>python数据开发工程师</td>\n",
       "      <td>https://m.liepin.com/job/1925012965.shtml</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>python数据工程师</td>\n",
       "      <td>https://m.liepin.com/a/19449311.shtml</td>\n",
       "      <td>25-35k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>算法科学家</td>\n",
       "      <td>https://m.liepin.com/a/19215715.shtml</td>\n",
       "      <td>65-70k·16薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>nlp语音人工智能工程师（外包）</td>\n",
       "      <td>https://m.liepin.com/a/19004681.shtml</td>\n",
       "      <td>30-54k·13薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>中高级python开发</td>\n",
       "      <td>https://m.liepin.com/a/18847711.shtml</td>\n",
       "      <td>20-40k·13薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>SPBU-数据分析师（风控）</td>\n",
       "      <td>https://m.liepin.com/job/1922750183.shtml</td>\n",
       "      <td>15-30k·16薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>python数据工程师</td>\n",
       "      <td>https://m.liepin.com/job/1926169045.shtml</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>python开发工程师</td>\n",
       "      <td>https://m.liepin.com/job/1925786431.shtml</td>\n",
       "      <td>15-22k·13薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>数据挖掘工程师 (MJ002691)</td>\n",
       "      <td>https://m.liepin.com/job/1926206729.shtml</td>\n",
       "      <td>20-40k·13薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>算法工程师（电商安全业务）</td>\n",
       "      <td>https://m.liepin.com/job/1926146353.shtml</td>\n",
       "      <td>20-40k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>机器学习专家</td>\n",
       "      <td>https://m.liepin.com/job/1925780721.shtml</td>\n",
       "      <td>面议</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>python培训师</td>\n",
       "      <td>https://m.liepin.com/job/1925142097.shtml</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Python developer-ETL</td>\n",
       "      <td>https://m.liepin.com/job/1924665117.shtml</td>\n",
       "      <td>12-16k·14薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>视觉算法工程师</td>\n",
       "      <td>https://m.liepin.com/job/1923016243.shtml</td>\n",
       "      <td>20-30k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>量化基金投资经理（总监/副总监/高级经理/经理/助理）</td>\n",
       "      <td>https://m.liepin.com/job/1923774079.shtml</td>\n",
       "      <td>12-35k·13薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Python数据分析及机器学习讲师</td>\n",
       "      <td>https://m.liepin.com/job/1916848263.shtml</td>\n",
       "      <td>20-30k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>中级Python开发工程师</td>\n",
       "      <td>https://m.liepin.com/job/1916900193.shtml</td>\n",
       "      <td>面议</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>业务数据分析师</td>\n",
       "      <td>https://m.liepin.com/job/1918822599.shtml</td>\n",
       "      <td>7-9k·12薪</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>python开发工程师</td>\n",
       "      <td>https://m.liepin.com/job/1916062237.shtml</td>\n",
       "      <td>面议</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             职称  \\\n",
       "0                                       数据分析助理    \n",
       "1    Python Architect/Developer   Python架构师/开发    \n",
       "2                                      数据分析工程师    \n",
       "3                                   AI工程师（实习岗）    \n",
       "4                                       python    \n",
       "5                                      量化策略分析师    \n",
       "6                                python数据开发工程师    \n",
       "7                                  python数据工程师    \n",
       "8                                        算法科学家    \n",
       "9                             nlp语音人工智能工程师（外包）    \n",
       "10                                 中高级python开发    \n",
       "11                              SPBU-数据分析师（风控）    \n",
       "12                                 python数据工程师    \n",
       "13                                 python开发工程师    \n",
       "14                          数据挖掘工程师 (MJ002691)    \n",
       "15                               算法工程师（电商安全业务）    \n",
       "16                                      机器学习专家    \n",
       "17                                   python培训师    \n",
       "18                        Python developer-ETL    \n",
       "19                                     视觉算法工程师    \n",
       "20                 量化基金投资经理（总监/副总监/高级经理/经理/助理）    \n",
       "21                           Python数据分析及机器学习讲师    \n",
       "22                               中级Python开发工程师    \n",
       "23                                     业务数据分析师    \n",
       "24                                 python开发工程师    \n",
       "\n",
       "                                           链接          薪水  \n",
       "0   https://m.liepin.com/job/1926863845.shtml    5-7k·12薪  \n",
       "1   https://m.liepin.com/job/1926257725.shtml  26-40k·12薪  \n",
       "2   https://m.liepin.com/job/1923135935.shtml   8-15k·13薪  \n",
       "3   https://m.liepin.com/job/1916670241.shtml    5-7k·12薪  \n",
       "4   https://m.liepin.com/job/1926919427.shtml  10-15k·12薪  \n",
       "5   https://m.liepin.com/job/1925083443.shtml  25-35k·12薪  \n",
       "6   https://m.liepin.com/job/1925012965.shtml  10-15k·12薪  \n",
       "7       https://m.liepin.com/a/19449311.shtml  25-35k·12薪  \n",
       "8       https://m.liepin.com/a/19215715.shtml  65-70k·16薪  \n",
       "9       https://m.liepin.com/a/19004681.shtml  30-54k·13薪  \n",
       "10      https://m.liepin.com/a/18847711.shtml  20-40k·13薪  \n",
       "11  https://m.liepin.com/job/1922750183.shtml  15-30k·16薪  \n",
       "12  https://m.liepin.com/job/1926169045.shtml  10-15k·12薪  \n",
       "13  https://m.liepin.com/job/1925786431.shtml  15-22k·13薪  \n",
       "14  https://m.liepin.com/job/1926206729.shtml  20-40k·13薪  \n",
       "15  https://m.liepin.com/job/1926146353.shtml  20-40k·12薪  \n",
       "16  https://m.liepin.com/job/1925780721.shtml          面议  \n",
       "17  https://m.liepin.com/job/1925142097.shtml  10-15k·12薪  \n",
       "18  https://m.liepin.com/job/1924665117.shtml  12-16k·14薪  \n",
       "19  https://m.liepin.com/job/1923016243.shtml  20-30k·12薪  \n",
       "20  https://m.liepin.com/job/1923774079.shtml  12-35k·13薪  \n",
       "21  https://m.liepin.com/job/1916848263.shtml  20-30k·12薪  \n",
       "22  https://m.liepin.com/job/1916900193.shtml          面议  \n",
       "23  https://m.liepin.com/job/1918822599.shtml    7-9k·12薪  \n",
       "24  https://m.liepin.com/job/1916062237.shtml          面议  "
      ]
     },
     "execution_count": 157,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# C-3\n",
    "# 易: '职称', '链结', '薪水'\n",
    "print(r.html.xpath('//li[@class=\"flexbox\"]/a/span/text()'))\n",
    "print(r.html.xpath('//li[@class=\"flexbox\"]/a/@href'))\n",
    "print(r.html.xpath('//li[@class=\"flexbox\"]/span/text()'))\n",
    "df = pd.DataFrame( {\n",
    "         \"职称\": r.html.xpath('//li[@class=\"flexbox\"]/a/span/text()'),\n",
    "         \"链接\": r.html.xpath('//li[@class=\"flexbox\"]/a/@href'),\n",
    "         \"薪水\": r.html.xpath('//li[@class=\"flexbox\"]/span/text()'),\n",
    "     } )\n",
    "df\n",
    "df.to_excel(\"liepin_c-3.xlsx\", sheet_name=\"搜查结果\")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['广州', '广州', '广州-天河区', '广州-天河区', '广州', '广州-天河区', '广州-黄埔区', '北京,广州,上海', '上海,深圳,广州', '广州,深圳', '深圳,广州', '广州', '广州-黄埔区', '广州', '广州-番禺区', '广州', '广州-天河区', '广州-黄埔区', '广州-越秀区', '广州-越秀区', '广州-珠江新城', '广州-天河区', '广州-天河区', '广州-天河区', '广州']\n",
      "['深圳市智灵时代科技有限公司', '天津恒程科技有限公司', '广东蔚海数问大数据科技有限公司', '赫基(中国)集团股份有限公司', '广州君思网络科技有限公司', '珠海钧誉资产管理有限公司', '广东南芯医疗科技有限公司', '某游戏公司', '某电子行业海外上市公司', '某知名外企', '新城科技', '酷狗音乐', '南芯智造(广州)医疗器械有限公司', '亚美信息科技', '欢聚集团', '唯品会(中国)', '中邮消费金融有限公司', '沈阳东软睿道教育服务有限公司', '友邦资讯科技', '恒安嘉新', '广州天汇资本管理有限公司', '北京传智播客教育科技有限公司', '玄武科技', '迪奥广州', '广州新博庭网络信息科技股份有限公司']\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>公司地点</th>\n",
       "      <th>公司名称</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>广州</td>\n",
       "      <td>深圳市智灵时代科技有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>广州</td>\n",
       "      <td>天津恒程科技有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>广东蔚海数问大数据科技有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>赫基(中国)集团股份有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>广州</td>\n",
       "      <td>广州君思网络科技有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>珠海钧誉资产管理有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>广州-黄埔区</td>\n",
       "      <td>广东南芯医疗科技有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>北京,广州,上海</td>\n",
       "      <td>某游戏公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>上海,深圳,广州</td>\n",
       "      <td>某电子行业海外上市公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>广州,深圳</td>\n",
       "      <td>某知名外企</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>深圳,广州</td>\n",
       "      <td>新城科技</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>广州</td>\n",
       "      <td>酷狗音乐</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>广州-黄埔区</td>\n",
       "      <td>南芯智造(广州)医疗器械有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>广州</td>\n",
       "      <td>亚美信息科技</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>广州-番禺区</td>\n",
       "      <td>欢聚集团</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>广州</td>\n",
       "      <td>唯品会(中国)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>中邮消费金融有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>广州-黄埔区</td>\n",
       "      <td>沈阳东软睿道教育服务有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>广州-越秀区</td>\n",
       "      <td>友邦资讯科技</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>广州-越秀区</td>\n",
       "      <td>恒安嘉新</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>广州-珠江新城</td>\n",
       "      <td>广州天汇资本管理有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>北京传智播客教育科技有限公司</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>玄武科技</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>迪奥广州</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>广州</td>\n",
       "      <td>广州新博庭网络信息科技股份有限公司</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        公司地点               公司名称\n",
       "0         广州      深圳市智灵时代科技有限公司\n",
       "1         广州         天津恒程科技有限公司\n",
       "2     广州-天河区    广东蔚海数问大数据科技有限公司\n",
       "3     广州-天河区     赫基(中国)集团股份有限公司\n",
       "4         广州       广州君思网络科技有限公司\n",
       "5     广州-天河区       珠海钧誉资产管理有限公司\n",
       "6     广州-黄埔区       广东南芯医疗科技有限公司\n",
       "7   北京,广州,上海              某游戏公司\n",
       "8   上海,深圳,广州        某电子行业海外上市公司\n",
       "9      广州,深圳              某知名外企\n",
       "10     深圳,广州               新城科技\n",
       "11        广州               酷狗音乐\n",
       "12    广州-黄埔区   南芯智造(广州)医疗器械有限公司\n",
       "13        广州             亚美信息科技\n",
       "14    广州-番禺区               欢聚集团\n",
       "15        广州            唯品会(中国)\n",
       "16    广州-天河区         中邮消费金融有限公司\n",
       "17    广州-黄埔区     沈阳东软睿道教育服务有限公司\n",
       "18    广州-越秀区             友邦资讯科技\n",
       "19    广州-越秀区               恒安嘉新\n",
       "20   广州-珠江新城       广州天汇资本管理有限公司\n",
       "21    广州-天河区     北京传智播客教育科技有限公司\n",
       "22    广州-天河区               玄武科技\n",
       "23    广州-天河区               迪奥广州\n",
       "24        广州  广州新博庭网络信息科技股份有限公司"
      ]
     },
     "execution_count": 156,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# C-4\n",
    "# 中: '公司地点', '公司名称'\n",
    "\n",
    "print(r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/a/text()'))\n",
    "print(r.html.xpath('//dd[@class=\"right-info\"]/ul/li[2]/a/text()'))\n",
    "\n",
    "数据 = pd.DataFrame( {\n",
    "         \"公司地点\": r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/a/text()'),\n",
    "         \"公司名称\": r.html.xpath('//dd[@class=\"right-info\"]/ul/li[2]/a/text()')\n",
    "     } )\n",
    "数据\n",
    "数据.to_excel(\"liepin_c-4.xlsx\", sheet_name=\"搜查结果\")\n",
    "数据 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'javascript:;', 'javascript:;', 'javascript:;', 'javascript:;', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/', 'https://m.liepin.com/gz/']\n",
      "['12小时前', '3小时前', '7小时前', '6小时前', '昨天', '昨天', '2020-03-19', '6小时前', '5小时前', '13小时前', '2020-03-21', '2020-03-18', '2020-03-06', '2020-03-02', '一个月前', '一个月前', '一个月前', '一个月前', '一个月前', '一个月前', '一个月前', '一个月前', '一个月前', '一个月前', '一个月前']\n",
      "['经验不限 学历不限', '5年以上 学历不限', '2年以上 本科及以上', '经验不限 硕士及以上', '经验不限 大专及以上', '3年以上 统招本科', '2年以上 本科及以上', '3年以上 学历不限', '1年以上 本科及以上', '3年以上 统招本科', '3年以上 学历不限', '2年以上 统招本科', '2年以上 本科及以上', '3年以上 本科及以上', '3年以上 本科及以上', '3年以上 统招本科', '5年以上 硕士及以上', '3年以上 统招本科', '5年以上 统招本科', '3年以上 本科及以上', '3年以上 本科及以上', '5年以上 大专及以上', '2年以上 统招本科', '1年以上 本科及以上', '1年以上 大专及以上']\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>公司URL</th>\n",
       "      <th>时间</th>\n",
       "      <th>经验</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>12小时前</td>\n",
       "      <td>经验不限 学历不限</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>3小时前</td>\n",
       "      <td>5年以上 学历不限</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>7小时前</td>\n",
       "      <td>2年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>6小时前</td>\n",
       "      <td>经验不限 硕士及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>昨天</td>\n",
       "      <td>经验不限 大专及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>昨天</td>\n",
       "      <td>3年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>2020-03-19</td>\n",
       "      <td>2年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>javascript:;</td>\n",
       "      <td>6小时前</td>\n",
       "      <td>3年以上 学历不限</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>javascript:;</td>\n",
       "      <td>5小时前</td>\n",
       "      <td>1年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>javascript:;</td>\n",
       "      <td>13小时前</td>\n",
       "      <td>3年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>javascript:;</td>\n",
       "      <td>2020-03-21</td>\n",
       "      <td>3年以上 学历不限</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>2020-03-18</td>\n",
       "      <td>2年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>2020-03-06</td>\n",
       "      <td>2年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>2020-03-02</td>\n",
       "      <td>3年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>5年以上 硕士及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>5年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>5年以上 大专及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>2年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>1年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>1年以上 大专及以上</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       公司URL          时间          经验\n",
       "0   https://m.liepin.com/gz/       12小时前   经验不限 学历不限\n",
       "1   https://m.liepin.com/gz/        3小时前   5年以上 学历不限\n",
       "2   https://m.liepin.com/gz/        7小时前  2年以上 本科及以上\n",
       "3   https://m.liepin.com/gz/        6小时前  经验不限 硕士及以上\n",
       "4   https://m.liepin.com/gz/          昨天  经验不限 大专及以上\n",
       "5   https://m.liepin.com/gz/          昨天   3年以上 统招本科\n",
       "6   https://m.liepin.com/gz/  2020-03-19  2年以上 本科及以上\n",
       "7               javascript:;        6小时前   3年以上 学历不限\n",
       "8               javascript:;        5小时前  1年以上 本科及以上\n",
       "9               javascript:;       13小时前   3年以上 统招本科\n",
       "10              javascript:;  2020-03-21   3年以上 学历不限\n",
       "11  https://m.liepin.com/gz/  2020-03-18   2年以上 统招本科\n",
       "12  https://m.liepin.com/gz/  2020-03-06  2年以上 本科及以上\n",
       "13  https://m.liepin.com/gz/  2020-03-02  3年以上 本科及以上\n",
       "14  https://m.liepin.com/gz/        一个月前  3年以上 本科及以上\n",
       "15  https://m.liepin.com/gz/        一个月前   3年以上 统招本科\n",
       "16  https://m.liepin.com/gz/        一个月前  5年以上 硕士及以上\n",
       "17  https://m.liepin.com/gz/        一个月前   3年以上 统招本科\n",
       "18  https://m.liepin.com/gz/        一个月前   5年以上 统招本科\n",
       "19  https://m.liepin.com/gz/        一个月前  3年以上 本科及以上\n",
       "20  https://m.liepin.com/gz/        一个月前  3年以上 本科及以上\n",
       "21  https://m.liepin.com/gz/        一个月前  5年以上 大专及以上\n",
       "22  https://m.liepin.com/gz/        一个月前   2年以上 统招本科\n",
       "23  https://m.liepin.com/gz/        一个月前  1年以上 本科及以上\n",
       "24  https://m.liepin.com/gz/        一个月前  1年以上 大专及以上"
      ]
     },
     "execution_count": 155,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# C-5\n",
    "# 难: '公司URL', '时间', '经验'\n",
    "\n",
    "print(r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/a/@href'))\n",
    "print(r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/time/text()'))\n",
    "list1=r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/text()')\n",
    "result = [x.strip() for x in list1 if x.strip()!='']\n",
    "print(result)\n",
    "\n",
    "\n",
    "\n",
    "数据 = pd.DataFrame( {\n",
    "    \"公司URL\":r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/a/@href'),\n",
    "    \"时间\":r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/time/text()'),\n",
    "    \"经验\":result\n",
    "     } )\n",
    "数据\n",
    "\n",
    "数据.to_excel(\"liepin_c-5.xlsx\", sheet_name=\"搜查结果\")\n",
    "数据 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "[参考](https://blog.csdn.net/weixin_43843287/article/details/85699838?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>职称</th>\n",
       "      <th>链接</th>\n",
       "      <th>薪水</th>\n",
       "      <th>公司地点</th>\n",
       "      <th>公司名称</th>\n",
       "      <th>公司URL</th>\n",
       "      <th>时间</th>\n",
       "      <th>经验</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>数据分析助理</td>\n",
       "      <td>https://m.liepin.com/job/1926863845.shtml</td>\n",
       "      <td>5-7k·12薪</td>\n",
       "      <td>广州</td>\n",
       "      <td>深圳市智灵时代科技有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>12小时前</td>\n",
       "      <td>经验不限 学历不限</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Python Architect/Developer   Python架构师/开发</td>\n",
       "      <td>https://m.liepin.com/job/1926257725.shtml</td>\n",
       "      <td>26-40k·12薪</td>\n",
       "      <td>广州</td>\n",
       "      <td>天津恒程科技有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>3小时前</td>\n",
       "      <td>5年以上 学历不限</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>数据分析工程师</td>\n",
       "      <td>https://m.liepin.com/job/1923135935.shtml</td>\n",
       "      <td>8-15k·13薪</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>广东蔚海数问大数据科技有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>7小时前</td>\n",
       "      <td>2年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AI工程师（实习岗）</td>\n",
       "      <td>https://m.liepin.com/job/1916670241.shtml</td>\n",
       "      <td>5-7k·12薪</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>赫基(中国)集团股份有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>6小时前</td>\n",
       "      <td>经验不限 硕士及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>python</td>\n",
       "      <td>https://m.liepin.com/job/1926919427.shtml</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "      <td>广州</td>\n",
       "      <td>广州君思网络科技有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>昨天</td>\n",
       "      <td>经验不限 大专及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>量化策略分析师</td>\n",
       "      <td>https://m.liepin.com/job/1925083443.shtml</td>\n",
       "      <td>25-35k·12薪</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>珠海钧誉资产管理有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>昨天</td>\n",
       "      <td>3年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>python数据开发工程师</td>\n",
       "      <td>https://m.liepin.com/job/1925012965.shtml</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "      <td>广州-黄埔区</td>\n",
       "      <td>广东南芯医疗科技有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>2020-03-19</td>\n",
       "      <td>2年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>python数据工程师</td>\n",
       "      <td>https://m.liepin.com/a/19449311.shtml</td>\n",
       "      <td>25-35k·12薪</td>\n",
       "      <td>北京,广州,上海</td>\n",
       "      <td>某游戏公司</td>\n",
       "      <td>javascript:;</td>\n",
       "      <td>6小时前</td>\n",
       "      <td>3年以上 学历不限</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>算法科学家</td>\n",
       "      <td>https://m.liepin.com/a/19215715.shtml</td>\n",
       "      <td>65-70k·16薪</td>\n",
       "      <td>上海,深圳,广州</td>\n",
       "      <td>某电子行业海外上市公司</td>\n",
       "      <td>javascript:;</td>\n",
       "      <td>5小时前</td>\n",
       "      <td>1年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>nlp语音人工智能工程师（外包）</td>\n",
       "      <td>https://m.liepin.com/a/19004681.shtml</td>\n",
       "      <td>30-54k·13薪</td>\n",
       "      <td>广州,深圳</td>\n",
       "      <td>某知名外企</td>\n",
       "      <td>javascript:;</td>\n",
       "      <td>13小时前</td>\n",
       "      <td>3年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>中高级python开发</td>\n",
       "      <td>https://m.liepin.com/a/18847711.shtml</td>\n",
       "      <td>20-40k·13薪</td>\n",
       "      <td>深圳,广州</td>\n",
       "      <td>新城科技</td>\n",
       "      <td>javascript:;</td>\n",
       "      <td>2020-03-21</td>\n",
       "      <td>3年以上 学历不限</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>SPBU-数据分析师（风控）</td>\n",
       "      <td>https://m.liepin.com/job/1922750183.shtml</td>\n",
       "      <td>15-30k·16薪</td>\n",
       "      <td>广州</td>\n",
       "      <td>酷狗音乐</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>2020-03-18</td>\n",
       "      <td>2年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>python数据工程师</td>\n",
       "      <td>https://m.liepin.com/job/1926169045.shtml</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "      <td>广州-黄埔区</td>\n",
       "      <td>南芯智造(广州)医疗器械有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>2020-03-06</td>\n",
       "      <td>2年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>python开发工程师</td>\n",
       "      <td>https://m.liepin.com/job/1925786431.shtml</td>\n",
       "      <td>15-22k·13薪</td>\n",
       "      <td>广州</td>\n",
       "      <td>亚美信息科技</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>2020-03-02</td>\n",
       "      <td>3年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>数据挖掘工程师 (MJ002691)</td>\n",
       "      <td>https://m.liepin.com/job/1926206729.shtml</td>\n",
       "      <td>20-40k·13薪</td>\n",
       "      <td>广州-番禺区</td>\n",
       "      <td>欢聚集团</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>算法工程师（电商安全业务）</td>\n",
       "      <td>https://m.liepin.com/job/1926146353.shtml</td>\n",
       "      <td>20-40k·12薪</td>\n",
       "      <td>广州</td>\n",
       "      <td>唯品会(中国)</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>机器学习专家</td>\n",
       "      <td>https://m.liepin.com/job/1925780721.shtml</td>\n",
       "      <td>面议</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>中邮消费金融有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>5年以上 硕士及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>python培训师</td>\n",
       "      <td>https://m.liepin.com/job/1925142097.shtml</td>\n",
       "      <td>10-15k·12薪</td>\n",
       "      <td>广州-黄埔区</td>\n",
       "      <td>沈阳东软睿道教育服务有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Python developer-ETL</td>\n",
       "      <td>https://m.liepin.com/job/1924665117.shtml</td>\n",
       "      <td>12-16k·14薪</td>\n",
       "      <td>广州-越秀区</td>\n",
       "      <td>友邦资讯科技</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>5年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>视觉算法工程师</td>\n",
       "      <td>https://m.liepin.com/job/1923016243.shtml</td>\n",
       "      <td>20-30k·12薪</td>\n",
       "      <td>广州-越秀区</td>\n",
       "      <td>恒安嘉新</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>量化基金投资经理（总监/副总监/高级经理/经理/助理）</td>\n",
       "      <td>https://m.liepin.com/job/1923774079.shtml</td>\n",
       "      <td>12-35k·13薪</td>\n",
       "      <td>广州-珠江新城</td>\n",
       "      <td>广州天汇资本管理有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>3年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Python数据分析及机器学习讲师</td>\n",
       "      <td>https://m.liepin.com/job/1916848263.shtml</td>\n",
       "      <td>20-30k·12薪</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>北京传智播客教育科技有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>5年以上 大专及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>中级Python开发工程师</td>\n",
       "      <td>https://m.liepin.com/job/1916900193.shtml</td>\n",
       "      <td>面议</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>玄武科技</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>2年以上 统招本科</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>业务数据分析师</td>\n",
       "      <td>https://m.liepin.com/job/1918822599.shtml</td>\n",
       "      <td>7-9k·12薪</td>\n",
       "      <td>广州-天河区</td>\n",
       "      <td>迪奥广州</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>1年以上 本科及以上</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>python开发工程师</td>\n",
       "      <td>https://m.liepin.com/job/1916062237.shtml</td>\n",
       "      <td>面议</td>\n",
       "      <td>广州</td>\n",
       "      <td>广州新博庭网络信息科技股份有限公司</td>\n",
       "      <td>https://m.liepin.com/gz/</td>\n",
       "      <td>一个月前</td>\n",
       "      <td>1年以上 大专及以上</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             职称  \\\n",
       "0                                       数据分析助理    \n",
       "1    Python Architect/Developer   Python架构师/开发    \n",
       "2                                      数据分析工程师    \n",
       "3                                   AI工程师（实习岗）    \n",
       "4                                       python    \n",
       "5                                      量化策略分析师    \n",
       "6                                python数据开发工程师    \n",
       "7                                  python数据工程师    \n",
       "8                                        算法科学家    \n",
       "9                             nlp语音人工智能工程师（外包）    \n",
       "10                                 中高级python开发    \n",
       "11                              SPBU-数据分析师（风控）    \n",
       "12                                 python数据工程师    \n",
       "13                                 python开发工程师    \n",
       "14                          数据挖掘工程师 (MJ002691)    \n",
       "15                               算法工程师（电商安全业务）    \n",
       "16                                      机器学习专家    \n",
       "17                                   python培训师    \n",
       "18                        Python developer-ETL    \n",
       "19                                     视觉算法工程师    \n",
       "20                 量化基金投资经理（总监/副总监/高级经理/经理/助理）    \n",
       "21                           Python数据分析及机器学习讲师    \n",
       "22                               中级Python开发工程师    \n",
       "23                                     业务数据分析师    \n",
       "24                                 python开发工程师    \n",
       "\n",
       "                                           链接          薪水      公司地点  \\\n",
       "0   https://m.liepin.com/job/1926863845.shtml    5-7k·12薪        广州   \n",
       "1   https://m.liepin.com/job/1926257725.shtml  26-40k·12薪        广州   \n",
       "2   https://m.liepin.com/job/1923135935.shtml   8-15k·13薪    广州-天河区   \n",
       "3   https://m.liepin.com/job/1916670241.shtml    5-7k·12薪    广州-天河区   \n",
       "4   https://m.liepin.com/job/1926919427.shtml  10-15k·12薪        广州   \n",
       "5   https://m.liepin.com/job/1925083443.shtml  25-35k·12薪    广州-天河区   \n",
       "6   https://m.liepin.com/job/1925012965.shtml  10-15k·12薪    广州-黄埔区   \n",
       "7       https://m.liepin.com/a/19449311.shtml  25-35k·12薪  北京,广州,上海   \n",
       "8       https://m.liepin.com/a/19215715.shtml  65-70k·16薪  上海,深圳,广州   \n",
       "9       https://m.liepin.com/a/19004681.shtml  30-54k·13薪     广州,深圳   \n",
       "10      https://m.liepin.com/a/18847711.shtml  20-40k·13薪     深圳,广州   \n",
       "11  https://m.liepin.com/job/1922750183.shtml  15-30k·16薪        广州   \n",
       "12  https://m.liepin.com/job/1926169045.shtml  10-15k·12薪    广州-黄埔区   \n",
       "13  https://m.liepin.com/job/1925786431.shtml  15-22k·13薪        广州   \n",
       "14  https://m.liepin.com/job/1926206729.shtml  20-40k·13薪    广州-番禺区   \n",
       "15  https://m.liepin.com/job/1926146353.shtml  20-40k·12薪        广州   \n",
       "16  https://m.liepin.com/job/1925780721.shtml          面议    广州-天河区   \n",
       "17  https://m.liepin.com/job/1925142097.shtml  10-15k·12薪    广州-黄埔区   \n",
       "18  https://m.liepin.com/job/1924665117.shtml  12-16k·14薪    广州-越秀区   \n",
       "19  https://m.liepin.com/job/1923016243.shtml  20-30k·12薪    广州-越秀区   \n",
       "20  https://m.liepin.com/job/1923774079.shtml  12-35k·13薪   广州-珠江新城   \n",
       "21  https://m.liepin.com/job/1916848263.shtml  20-30k·12薪    广州-天河区   \n",
       "22  https://m.liepin.com/job/1916900193.shtml          面议    广州-天河区   \n",
       "23  https://m.liepin.com/job/1918822599.shtml    7-9k·12薪    广州-天河区   \n",
       "24  https://m.liepin.com/job/1916062237.shtml          面议        广州   \n",
       "\n",
       "                 公司名称                     公司URL          时间          经验  \n",
       "0       深圳市智灵时代科技有限公司  https://m.liepin.com/gz/       12小时前   经验不限 学历不限  \n",
       "1          天津恒程科技有限公司  https://m.liepin.com/gz/        3小时前   5年以上 学历不限  \n",
       "2     广东蔚海数问大数据科技有限公司  https://m.liepin.com/gz/        7小时前  2年以上 本科及以上  \n",
       "3      赫基(中国)集团股份有限公司  https://m.liepin.com/gz/        6小时前  经验不限 硕士及以上  \n",
       "4        广州君思网络科技有限公司  https://m.liepin.com/gz/          昨天  经验不限 大专及以上  \n",
       "5        珠海钧誉资产管理有限公司  https://m.liepin.com/gz/          昨天   3年以上 统招本科  \n",
       "6        广东南芯医疗科技有限公司  https://m.liepin.com/gz/  2020-03-19  2年以上 本科及以上  \n",
       "7               某游戏公司              javascript:;        6小时前   3年以上 学历不限  \n",
       "8         某电子行业海外上市公司              javascript:;        5小时前  1年以上 本科及以上  \n",
       "9               某知名外企              javascript:;       13小时前   3年以上 统招本科  \n",
       "10               新城科技              javascript:;  2020-03-21   3年以上 学历不限  \n",
       "11               酷狗音乐  https://m.liepin.com/gz/  2020-03-18   2年以上 统招本科  \n",
       "12   南芯智造(广州)医疗器械有限公司  https://m.liepin.com/gz/  2020-03-06  2年以上 本科及以上  \n",
       "13             亚美信息科技  https://m.liepin.com/gz/  2020-03-02  3年以上 本科及以上  \n",
       "14               欢聚集团  https://m.liepin.com/gz/        一个月前  3年以上 本科及以上  \n",
       "15            唯品会(中国)  https://m.liepin.com/gz/        一个月前   3年以上 统招本科  \n",
       "16         中邮消费金融有限公司  https://m.liepin.com/gz/        一个月前  5年以上 硕士及以上  \n",
       "17     沈阳东软睿道教育服务有限公司  https://m.liepin.com/gz/        一个月前   3年以上 统招本科  \n",
       "18             友邦资讯科技  https://m.liepin.com/gz/        一个月前   5年以上 统招本科  \n",
       "19               恒安嘉新  https://m.liepin.com/gz/        一个月前  3年以上 本科及以上  \n",
       "20       广州天汇资本管理有限公司  https://m.liepin.com/gz/        一个月前  3年以上 本科及以上  \n",
       "21     北京传智播客教育科技有限公司  https://m.liepin.com/gz/        一个月前  5年以上 大专及以上  \n",
       "22               玄武科技  https://m.liepin.com/gz/        一个月前   2年以上 统招本科  \n",
       "23               迪奥广州  https://m.liepin.com/gz/        一个月前  1年以上 本科及以上  \n",
       "24  广州新博庭网络信息科技股份有限公司  https://m.liepin.com/gz/        一个月前  1年以上 大专及以上  "
      ]
     },
     "execution_count": 160,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "数据 = pd.DataFrame( {\n",
    "    \"职称\": r.html.xpath('//li[@class=\"flexbox\"]/a/span/text()'),\n",
    "    \"链接\": r.html.xpath('//li[@class=\"flexbox\"]/a/@href'),\n",
    "    \"薪水\": r.html.xpath('//li[@class=\"flexbox\"]/span/text()'),\n",
    "    \"公司地点\": r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/a/text()'),\n",
    "    \"公司名称\": r.html.xpath('//dd[@class=\"right-info\"]/ul/li[2]/a/text()'),\n",
    "    \"公司URL\":r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/a/@href'),\n",
    "    \"时间\":r.html.xpath('//dd[@class=\"right-info\"]/ul/li[3]/time/text()'),\n",
    "    \"经验\":result\n",
    "     } )\n",
    "数据\n",
    "数据.to_excel(\"liepin_pandas.xlsx\", sheet_name=\"搜查结果\")\n",
    "数据 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "749px",
    "left": "1125.609375px",
    "top": "110px",
    "width": "281.390625px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
